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ABSTRACT 

In this paper we propose a novel approach to multi-action 
recognition that performs joint segmentation and classifica¬ 
tion. This approach models each action using a Gaussian 
mixture using robust low-dimensional action features. Seg¬ 
mentation is achieved by performing classification on over¬ 
lapping temporal windows, which are then merged to pro¬ 
duce the final result. This approach is considerably less com¬ 
plicated than previous methods which use dynamic program¬ 
ming or computationally expensive hidden Markov models 
(HMMs). Initial experiments on a stitched version of the 
KTH dataset show that the proposed approach achieves an 
accuracy of 78.3%, outperforming a recent HMM-based ap¬ 
proach which obtained 71.2%. 

General Terms 

Algorithms, Performance, Experimentation 

Keywords 

human action recognition, multi-action recognition, segmen¬ 
tation, stochastic modelling, Gaussian mixture models 


1. INTRODUCTION 

Action recognition in real world surveillance videos has 
become of great interest due to its potential applications 
in daily life situations such as smart homes, home nursing, 
building security, and annotation of human actions in video 
using minimal manual supervision [8] . 

Human action recognition can be divided into two areas: 
(i) single action recognition, and (ii) multi-action recogni¬ 
tion. In the context of this paper we define single action 
recognition as the unique action performed by a person, and 
multi-action recognition as a set of actions where one person 
performs a sequence of such single actions [23] . 
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In most computer vision literature, action recognition ap¬ 
proaches have concentrated on recognising a single action 
performed by a person, instead of continuous human ac¬ 
tions or multi-actions [5] . Multi-action recognition is a fun¬ 
damental problem in human action understanding. When 
observing videos of human behaviour, an ongoing problem 
for computer vision algorithms is recognising and/or seg¬ 
menting individual and significant actions from within the 
duration of the motion sequence [5] . It is challenging due to 
the high variability of appearances, shapes, possible occlu¬ 
sions, large variability in the temporal scale and periodicity 
of human actions, the complexity of articulated motion, the 
exponential nature of all possible movement combinations, 
as well as the prevalence of irrelevant background [14, 23]. 

Limited work has been conducted on multi-action recogni¬ 
tion. A joint action segmentation and classification method 
was presented in [24] called Multi-Task Gonditional Ran¬ 
dom Field (MT-GRF) which classifies motions into multi¬ 
ple labels, such as a person folding their arms while seated. 
However, this approach has only been applied to synthetic 
datasets. Hoai et al. [14] address joint segmentation and 
classification by classifying temporal regions using a multi¬ 
class SVM and performing segmentation using dynamic pro¬ 
gramming. More recently, Borzehsi et al. [4] proposed the 
use of hidden Markov models with irregular observations 
(termed HMM-MIO) to perform multi-action recognition. 
A drawback for both [4] and [14] is that they have a large 
number of parameters to optimise. Furthermore, [4] uses 
very high dimensional feature vectors and [14] requires fully 
labelled annotations for training. 

A reliable feature descriptor is a crucial stage for the 
success of an action recognition system. One popular de¬ 
scriptor for the action recognition task is Spatio-Temporal 
Interest Points (STIPs) [15]. STIP based descriptors have 
some drawbacks [6, 7, 10, 16]: (i) they focus on local spatio- 
temporal information instead of global motion, (ii) they can 
be unstable and imprecise (varying number of STIP detec¬ 
tions) leading to low repeatability, (iii) they are computa¬ 
tionally expensive, and (iv) produce sparse detections. See 
Fig. 1 for a demonstration of STIP based detection. 

Other feature extraction techniques used for action recog¬ 
nition include gradients [18] and optical flow [1, 10]. Each 
pixel in the gradient image helps extract relevant informa¬ 
tion, e.g. edges. Since the task of action recognition is based 
on a sequence of frames, optical-flow provides an efficient 
way of capturing the local dynamics in the scene [10]. See 
the right side in Fig. 1 for a demonstration of gradient based 
detection. 



In this paper we propose a novel framework to perform 
multi-action recognition that requires few parameters. We 
model each action as a Gaussian mixture model (GMM) 
using the action features described in [21] and use as train¬ 
ing data the videos of single actions. We use robust low¬ 
dimensional action features which incorporate optical flow 
and image gradient information and we only use those fea¬ 
tures which have high spatial frequency (that correspond 
to edges) to be considered part of an action. In this way 
we only use relevant action information. Segmentation of 
actions is achieved by applying the GMM classifier over a 
temporal sliding window in an overlapping manner, which 
allows us to better deal with temporal misalignment and can 
lead to improved performance [12]. 

Contributions. We propose a more efficient system that 
requires fewer parameters to be optimised, and simultane¬ 
ously segments and recognises multi-actions. This is in con¬ 
trast to the methods presented in [4, 14] which have a larger 
parameter search space. Furthermore, we avoid the need 
for a custom dynamic programming definition. Lastly, the 
feature descriptors used in this work are more robust and 
have smaller dimensionality. The proposed method is able 
to handle varying number of feature vectors obtained from 
each frame, allowing selective feature extraction from the 
most useful image areas. 

We continue the paper as follows. In Section 2, we sum¬ 
marise prior work for feature descriptors, single action and 
multi-action recognition. We then describe the proposed 
method for joint multi-action segmentation and recognition 
in Section 3. In Section 4, we present experiments which 
show that the proposed method outperforms existing meth¬ 
ods. Section 5 summarises the main findings and provides 
potential areas for future work. 


Figure 1: Left: Spatio-Temporal Interest Points De¬ 
scriptors (STIPs) are unstable, imprecise and overly 
sparse. Right: interest pixels (marked in red) ob¬ 
tained using magnitude of gradient. 


2. RELATED WORK 

In this section, we present an overview of the state-of-the- 
art of action recognition. To begin with, we depict some 
popular descriptors used for action recognition. This is fol¬ 
lowed by a review of techniques for single action recognition. 
Finally, we present an overview of the approaches which per¬ 
form multi-action segmentation and recognition. 

2.1 Descriptors for Action Recognition 

Several descriptors have been used to represent human 
actions. Among them, we can find descriptors based on 
Spatio-Temporal Interest Points (STIPs) [15], gradients [18], 
and optical flow [1]. The STIP based detector finds inter¬ 
est points and is a space-time extension of the Harris and 
Forstner interest point detector [9, 13]. Recently, STIP de¬ 
scriptors have led to relatively good action recognition per¬ 
formance [28]. However, STIP based descriptors have some 
drawbacks as reported in [6, 7, 10, 16]. These drawbacks 
include: focus on local spatio-temporal information instead 
of global motion, can be unstable and imprecise (varying 
number of STIP detections) leading to low repeatability, re¬ 
dundancy can occur, are computationally expensive and the 
detections can be overly sparse (see Fig. 1). 

Gradients have been used as a robust image and video 
representation [18]. Each pixel in the gradient image helps 
extract relevant information, e.g. edges. Gradients can be 
computed at every spatio-temporal location (x, y, t) in any 
direction in a video. 

Since the task of action recognition is based on a sequence 
of frames in order to analyse the various motion patterns 
in the video [18], optical-flow provides an efficient way of 
capturing the local dynamics in the scene [10]. Optical flow 
describes the motion dynamics of an action, calculating the 
absolute motion between two frames, which contains motion 
from many sources [1, 12, 26]. 

2.2 Single Action Recognition 

Hidden Markov models (HMMs) have a long history of use 
in activity recognition. One action is a sequence of events 
ordered in space and time, and HMMs capture structural 
and transitional features and therefore the dynamics of the 
system [17]. Gaussian Mixture Models (GMMs) have also 
been explored for recognising single actions. In [6], each 
set of feature vectors is modelled with a GMM. Then, the 
likelihood of a feature vector belonging to a given human 
action can be estimated. 

Other successful methods for single action recognition in¬ 
clude Riemannian manifold based approaches and those that 
use dense trajectory tracking. For single action recogni¬ 
tion, Riemannian manifolds have been investigated in [11, 
21]. In [11] optical flow features are extracted and then a 
compact covariance matrix representation of such features 
is calculated. Such a covariance matrix can be thought of as 
a point on a Riemannian manifold. An action and gesture 
recognition method based on spatio-temporal covariance de¬ 
scriptors obtained from optical flow and gradient descriptors 
is presented in [21]. 

Two recent approaches for single action recognition based 
on tracking of interest points makes use of dense trajecto¬ 
ries [26, 27]. These dense trajectories allow the descrip¬ 
tion of videos by sampling dense points from each frame 
and then tracking them based on displacement information 
from a dense optical flow field. Although this approach ob- 









tains good performance, it is computationally expensive, es¬ 
pecially the calculation of the dense optical flow which is 
calculated at several scales. 

2.3 Multi-Action Recognition 

Multi-action recognition, in our context, consists of seg¬ 
menting and recognising single actions from a video sequence 
where one person performs a sequence of such actions [23]. 
The process for segmenting and recognising multiple actions 
in a video can be solved either as two independent problems 
or a joint problem. 

One of the first methods to multi-action recognition, 
Multi-Task Conditional Random Field (MT-CRF), was pro¬ 
posed in [24]. This method consists of classifying motions 
into multi-labels, e.g. a person folding their arms while sit¬ 
ting. Despite this approach being presented as robust, it 
has been only applied on two synthetic datasets. Two more 
recent methods [14, 4] have been applied to more realistic 
multi-action datasets. Hoai et al. [14] deal with the dual 
problem of human action segmentation and classihcation. 
This approach is presented as a learning framework that 
simultaneously performs temporal segmentation and event 
recognition in time series. The supervised training is done 
via multi-class SVM, where the SVM weight vectors have to 
be learnt, as well as the other SVM parameters. For the seg¬ 
mentation, the learnt weight vectors are used and other set 
of parameters are optimised (number of segments, and mini¬ 
mum and maximum lengths of segments). The segmentation 
is done using dynamic programming. The feature mapping 
depends on the dataset employed, and includes trajectories, 
features extracted from binary masks and STIPs. 

Although the method proposed in [14] is promising, it 
has several drawbacks. One drawbacks is the requirement 
of fully labelled annotation for training. Furthermore, it 
suffers from the limitations of dynamic programming where 
writing the code that evaluates sub-problems in the most 
efficient order is often nontrivial [25] . Also, the binary masks 
are not always available and the STIPs descriptors present 
deficiencies as mentioned in Section 2.1. This method also 
requires an extensive search for optimal parameters. 

An approach termed Hidden Markov Model for Multi¬ 
ple, Irregular Observations (HMM-MIO) [4] has also been 
proposed for the multi-action recognition task. HMM-MIO 
jointly segments and classifies observations which are irregu¬ 
lar in time and space, and characterised by high dimension¬ 
ality. The high dimensionality is reduced by using PPCA 
(probabilistic Principal Component Analysis). Moreover, 
HMM-MIO deals with heavy tails and outliers exhibited by 
empirical distributions by modelling the observation densi¬ 
ties with a long-tailed distribution, the student’s t. HMM- 
MIO requires the search of the following 4 optimal parame¬ 
ters: (i) the resulting reduced dimension (R), (ii) the number 
of components in each observation mixture (M), (iii) the de¬ 
gree of the t-distribution (r^), and (iv) the number of cells (or 
regions) used to deal with the space irregularity S. To this 
end, for multiple observations of a frame, they postulate: 

YlnLiPi^7\yt) if-^t > 1 

( 1 ) 

1 if Aft = 0 

where each observation consists of the pair = {d^, of 
the descriptor d'^ and the cell index where it occurs Sf. The 
index frame, the observation index, the total number of ob¬ 




servations, and the hidden states are given by t, n, Nt, and 
yt, respectively. As feature descriptors, HMM-MIO extracts 
STIPs proposed in [15], with the default 162-dimension de¬ 
scriptor. The classification is carried out on a per frame 
basis. HMM-MIO also suffers from the drawback of a large 
search of optimal parameters and the use of STIP descrip¬ 
tors. 


3. PROPOSED METHOD 
3.1 Selective Feature Extraction 

Descriptors based on optical flow and gradient have been 
proved to be reliable for the action recognition task [12, 21]. 
Following [21], we extract the following features for each 
pixel: 

f{x,y,t) := [x, y, g, o (2) 

where x and y are the pixel coordinates, while g and o are 
dehned as: 
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The first four gradient-based features in Eq. (3) repre¬ 
sent the first and second order intensity gradients at pixel 
location (x, y). The last two gradient features represent gra¬ 
dient magnitude and gradient orientation. The optical-flow 
based features in Eq. (4) represent in order: the horizontal 
and vertical components of the flow vector, the first order 
derivatives with respect to t, the divergence and vorticity of 
the optical flow in the context of action recognition proposed 
in [I]. We obtain a 14-dimensional feature vector per pixel. 

Due to not all pixels corresponding to the object of inter¬ 
est, we are only interested in pixels with a gradient magni¬ 
tude greater than a threshold r [12]. As such, we discard all 
the features vectors from locations with a small magnitude. 
See the right side of Eig. I for an example. 


3.2 Learning Action Models 

A Gaussian Mixture Model (GMM) is a weighted sum of 
Ng component Gaussian densities [3], dehned as: 

p{x\\) = Sg) (5) 

where a? is a D-dimensional continuous-valued data vector 
(i.e. measurement or features), Wg is the weight of the ^-th 
Gaussian (with constrains 0 < Wg < 1 and '^ghwg = I), 
and JV{x\, fig, Xl^) is the component Gaussian density with 
mean g, and covariance matrix Xl, given by: 

-V(a;|,M, S)=- — p 1 exp j-ha; - gf'E~\x - m)| 

(27r)T|S|2 I 2 J 

The complete Gaussian mixture model is parametrised by 
the mean vectors, covariance matrices and weights of all 
component densities. These parameters are collectively rep¬ 
resented by the notation A = {wg, fXg, 

Each video Vi is represented as a set of frames 
{dt}t=i and a set of corresponding action labels 
Each frame is represented as a set of k feature vectors 
where k can vary depending on frame 

content. 








All the feature vectors belonging to the same action are 
pooled together. For each action a GMM is trained with 
Ng components. This results in a set of GMM models that 
we will express as {Aa}a=i? where A is the total number of 
actions. 

3.3 Recognition of Multiple Actions 

For each frame It in a given testing video Vi, we calculate 
the feature vectors f{x,y,t) following the same procedure as 
explained in Section 3.2. Each frame is then represented as 
a set of k feature vectors Ft = {ft^ftr‘‘ ^ ft}- We break 
Vi into small overlapping segments and classify each of these 
segments. 

As reported in [12], many human actions are repetitive in 
nature, such as walking and boxing. An important consid¬ 
eration is to determine the duration of each segment L such 
that it is long enough to contain one complete cycle of the 
action. 

Let = [Ft Ft-^i • • • Ft-^L] be the sequence of N feature 
vectors taken from segment S(^t,t-\-L]- The feature vectors 
are assumed independent, so the average log-likelihood of a 
model Aa is computed as: 

logp(X'*|Ao) = ^ V). (6) 

In each segment s, we compute the average log-likelihood 
for each model Aa by using Eq. (6). A log-likelihood vector 
Pg is obtained: 

p, = [ logp(X*|Ai), • • • , logp(X^IAA) ]. (7) 

Then, a new vector of log-likelihoods is calculated in 
the next segment, jumping one frame at a time. We repeat 
this process until we reach the end of the video, resulting in 
a set of S log-likelihood vectors {Ps}f=i. 

To calculate the total contribution pf^^n 
likelihood vectors to each frame, we sum all the p^ obtained 
over this frame: 

= (8) 

The estimated label It is calculated as: 

k = argmax (9) 

VA„ 

We aim to segment the video into separate actions. To this 
end, we examine the sequence of estimated labels. Know¬ 
ing that the duration of each segment is L, if any estimated 
sequence has a length less than L, the short segment is con¬ 
sidered a transition boundary between two actions and for 
this reason this estimated sequence is combined to the pre¬ 
ceding segment. 

4. EXPERIMENTS 

We evaluated our proposed method for joint action seg¬ 
mentation and classihcation on a stitched version of the 
KTH dataset [22] . The KTH dataset is composed of 6 types 
of human actions (walking, jogging, running, boxing, hand 
waving and hand clapping) which are performed several 
times by 25 subjects in 4 scenarios: outdoors, outdoors with 
scale variation, indoors and outdoors with varying clothes. 
The image size is of 160 x 120 pixels, and temporal resolution 
is of 25 frames per second. In total there are 25 x 6 x 4 = 600 
video files. Each video contains an individual performing the 



Figure 2: The “boxing” action in the KTH dataset. 



Figure 3: Multi-act ion sequence in the stitched ver¬ 
sion of the KTH dataset: hand clapping, walking, 
boxing and running. 


same action. See Fig. 2 for an example. This action is per¬ 
formed 4 times and each subdivision or action-instance (in 
terms of start-frame and end-frame) is provided as part of 
the dataset^. This dataset contains 2391 action-instances, 
with a length between 1 and 14 seconds [2]. 

Following [4], the stitched dataset was obtained by sim¬ 
ply concatenating existing single-action instances into se¬ 
quences. The act ion-instances were picked randomly alter¬ 
nating between the two groups of {boxing, hand-waving, 
and hand-clapping} and {walking, jogging, and running} to 
accentuate action boundaries. We refer to each of these se¬ 
quences a “multi-action video”. See Fig. 3 for an example. 

The dataset was divided into two sets: one for training and 
one for testing. 64 and 36 multi-action videos were used for 
training and testing respectively. For the experiment, we 
used 3-fold cross-validation. We use the original dataset to 
train our GMM model, it means that we only use video 
containing single actions to train each GMM per action. As 
the KTH dataset was collected in 4 scenarios, we train a 
model per action and per scenario. Each pixel descriptor 
is a 14-dimensional feature vector and it is extracted using 
Eq. (2). All the feature vectors belonging to a frame t is 
given by Ft. Although the optical flow is calculated in all 
frames, in order to speed up processing, we only use the 
feature vectors extracted from second frame. 

For our experiments, we used diagonal covariance matri¬ 
ces. GMM parameters were estimated using descriptors ob¬ 
tained from training videos using the iterative Expectation- 
Maximisation (EM) algorithm [3]. The experiments were 
implemented with the help of the Armadillo G++ li¬ 
brary [20]. The threshold r (used for selecting feature vec¬ 
tors) was empirically set to 40. The duration of each segment 
L was set to 25 frames (1 second), which is the minimum 
length of an act ion-instance in this dataset [2]. 

An initial set of experiments was performed to find the op¬ 
timal number of components Ng. Using a fixed number of 
components Ng = {16,64,256,1024}, we evaluate the per¬ 
formance on one fold. The results are reported in frame-level 
accuracy (%) in Table 1. 

We found that using Ng = 1024 provided optimal per¬ 
formance (77.0%). This better performance attained with 
1024 components is explained by the fact that GMMs with 
large number of components are known to have the ability 
to model any given probability distribution function [6] . We 
kept the number of Gaussians constant for the remainder of 
our experiments. 

^ http: / / www.nada.kth.se / cvap / actions / 





Table 1: Comparison of one run testing for several 
number of Gaussians (Ng). 


N, 

Accuracy (%) 

16 

71.1 

64 

73.2 

256 

75.3 

1024 

77.0 


Table 2: Comparison of the proposed method to the 
state of the art. 


Method 

Accuracy (%) 

Bag-of-features [4] 

61.8 

HMM-MIO [4] 

71.2 

proposed method 

78.3 ±2.6 


We compared our proposed method with the state-of-the- 
art HMM-MIO method [4]. As we mentioned in Section 2, 
HMM-MIO requires the search of many optimal parameters. 
This results in a complex space parameter search (4 free 
parameters). In contrast, our proposed method only has 2 
free parameters (r and Ng). 

The comparative results of the proposed method, the 
HMM-MIO, and the baseline system Bag-of-Features ap¬ 
proach trained with k-means using 256 clusters, are shown in 
Table 2. The proposed method obtains the highest accuracy 
(78.3 ±2.6%). We report the results for proposed method in 
terms of average accuracy and standard deviation over the 
3 folds. 

5. CONCLUSIONS AND FUTURE WORK 

In this paper, we have proposed an improved approach for 
joint multi-action segmentation and recognition from video 
sequences. We proposed the use of low-dimensional gradi¬ 
ent and optical flow based descriptors which do not suffer 
from the instability, imprecision and sparsity exhibited by 
STIP descriptors used by the HMM-MIO system of [4] . The 
proposed approach also obviates the need for an extra di¬ 
mensionality reduction step, as is the case for the HMM- 
MIO system. Furthermore, the proposed system provides 
a simpler framework with half the number of parameters 
to optimise. Initial experiments on a stitched version of 
the KTH dataset show that the proposed approach achieves 
an accuracy of 78.3%, compared to 71.2% achieved by the 
HMM-MIO system. 

Possible directions for future work are: (i) explore the use 
of spatio-temporal descriptors dividing each frame into cells 
or regions to deal with spatial irregularities, and (ii) sepa¬ 
rate irrelevant motion from action motion via explicit fore¬ 
ground segmentation [19], which would be especially useful 
when dealing with actions in uncontrolled settings. 
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